POLI 572B

Michael Weaver

March 1, 2021

Multivariate Least Squares

Objectives

Recap

  • Multivariate Least Squares mathematical properties

Interpretation

  • Best Linear Approximation of CEF
  • Dummy Variables
  • Slopes and Intercepts

Today

  • algorithm with mathematical properties
    • what is the algorithm doing?
    • what are mathematical facts about the algorithm?

Today

  • How does the algorithm relate back to the conditional expectation function?

Today

  • Interpreting simple scenarios:

    • dummy variables
    • slopes and intercepts

Recap

Multivariate Least Squares:

Matrix equation fits mean of \(Y\) as a linear function of many variables:

\[Y_i = b_0 + b_1 X_{1i} + b_2 X_{2i} + \ldots + b_k X_{ki}\]

\(\hat{Y_i}\) is the predicted mean of \(Y\) given values of \(\mathbf{X}_i\)

Least Squares

We obtain coefficients \(b_0, b_1, \dots, b_k = \beta\):

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]

This is the matrix formula for least squares regression.

If \(X\) is a column vector of \(1\)s, \(\beta\) is just the mean of \(Y\). (We just did this)

If \(X\) is a column of \(1\)s and a column of \(X_1\), it is an intercept and slope on \(X\).

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

\[\mathbf{X}\beta = \begin{pmatrix} 1 & x_{11} & x_{21} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \hat{Y}\]

Mathematical Requirements:

  1. Matrix \(X\) has “full rank”
  • This means that all of the columns of \(X\) are linearly independent.
    • cannot have two identical columns
    • cannot have set of columns that sum up to another column multiplied by a scalar
  • If \(X\) is not full rank, cannot be inverted, cannot do least squares.
  1. \(n \geq p\): we need to have more data points than variables in our equation
  • no longer trivial with multiple regression

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\) or \(b_0\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(3\). \(\overline{Y} = \overline{X}'\beta\): the predicted value \(\hat{Y}\) when all variables in \(X\) are at their mean is the mean of \(Y\).

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(4\). Slope \(b_k\) captures change in \(Y\) due to variation in \(X_k\) that is orthogonal/uncorrelated to all other columns of \(X\).

Multivariate Least Squares:

\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]

\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)

where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:

\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{j} X_{j}\)

\(X_k^*\) is the residual from regressing \(X_k\) on all other \(X_{j \neq k}\)

Multivariate Least Squares:

How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?)

  • The slope \(b_k\) is the change in \(Y\) with a one-unit change in the part of \(X_k\) that is uncorrelated with/orthogonal to the other variables in the regression.

The CEF

The CEF

\[E(Y_i | X_i) = f(X)\]

Conditional: mean of \(Y\) is different at different levels of \(X\)

Expectation: expected value/mean of \(Y\)

Function: mathematical function links values of \(X\) to mean of \(Y\). Could be, does not need to be, linear

Regression and CEF

How does multivariate regression relate back to Conditional Expectation Function?

  • Angrist and Pischke give justification
  • least squares is the best linear approximation of the conditional expectation function \(E(Y_i | X_i)\)

## `geom_smooth()` using formula 'y ~ x'

Least Squares and CEF

Least Squares predictions \(\widehat{Y_i} = X_i'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the smallest mean-squared error (minimum distance to the true \(E(Y_i | X_i)\)).

  • Least Squares could also fit a line for the mean of Y: \(E(Y) | X\). Not just a prediction for each \(Y_i\). We could use \(E(Y_i | X_i)\) as the dependent variable

Exercise: Download the data. Contains mean income for each group of people with the same AGE in years

https://www.dropbox.com/s/gyvv0mgvfic9jv7/acs_grouped.rda?dl=0

Exercise:

  • Plot INCEARN by AGE

Use matrix calculations in R to obtain the linear approximation of \(E(Y_i | X_i)\):

  • What is the intercept? What does it mean?
  • What is the slope? What does it mean?
##           [,1]
## [1,] 17501.992
## [2,]  3668.629

Do your estimates agree? Why or why not?

We are not accounting for the number of people in each value of \(X_i\)

##           [,1]
## [1,] 17501.992
## [2,]  3668.629

Least Squares and CEF

We can reproduce any individual-level least squares by:

  1. Collapsing data to all unique values \(X_i\) to groups: \(X_g\) across \(g \in G\)
  2. Take the group mean of \(Y_i\): \(Y_g\)
  3. Regress \(Y_g\) on \(X_g\), weighting by group size \(n_g\).

We are directly modelling the conditional expectation of \(E(Y_i | X_i)\) as a linear function.

Least Squares and CEF

Least Squares estimates the mean of \(Y\) as a function of values of \(X\).

  • Conditional Expectation Function
  • or \(E(Y | X = X_i)\) is best prediction \(\hat{Y_i} | X_i\)

Dummy Variables

Dummy Variables

What are they?

  • Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise.

  • Can be used for Yes/No categorical variables
  • Can be used for variables with many mutually exclusive categories (e.g. race, marital status, etc.)

How do we make them?

  • Easy to do in R with as.factor(): each unique value will get a dummy.

lm(y ~ as.factor(any_variable))

Dummy Variables

  1. What does the design matrix \(X\) in this regression look like?
  2. What does the \(b_0\) tell us?
  3. What does the \(b_1\) tell us?

Dummy Variables

  1. What does the design matrix \(X\) in this regression look like?
  2. What does the \(b_0\) tell us?
  3. What does the \(b_1\) tell us?
  4. What does the \(b_2\) tell us?

Dummy Variables

  1. What does the design matrix \(X\) in this regression look like?
  2. What does the \(b_0\) tell us?
  3. What does the \(b_1\) tell us?
##   (Intercept) FEMALE
## 1           1      0
## 2           1      0
## 3           1      0
## 4           1      1
## 5           1      0
## 6           1      1
## (Intercept)      FEMALE 
##   201941.70   -57269.02
##   (Intercept) FEMALE MALE
## 1           1      0    1
## 2           1      0    1
## 3           1      0    1
## 4           1      1    0
## 5           1      0    1
## 6           1      1    0
## (Intercept)      FEMALE        MALE 
##   201941.70   -57269.02          NA
##   FEMALE MALE
## 1      0    1
## 2      0    1
## 3      0    1
## 4      1    0
## 5      0    1
## 6      1    0
##   FEMALE     MALE 
## 144672.7 201941.7

Dummy Variables

What do they do?

If there is an intercept

  • one category must be excluded (“reference group”); otherwise linear dependence. (R will do this by default)
  • coefficient is difference in means between the group indicated by the dummy and the intercept (“reference group” mean)

If there is no intercept:

  • all categories must be included
  • fits separate intercepts for each group: coefficient is the mean of the group indicated by the dummy (when all other variables are \(0\)).

Exercise:

How did different regions in the UK vote on Brexit?

Exercise:

Using the lm function:

  1. Regress Pct_Lev on Region
  2. Interpret the coefficients returned (summary(your_model))
  3. Regression Pct_Lev on -1 + Region
  4. Interpret the coefficients.
  5. Without looking at your_model$fitted.values, what can you say about the \(\widehat{Y}\) produced by these two regressions?

Example:

## 
##                     East            East Midlands                   London 
##                       47                       40                       33 
##               North East               North West         Northern Ireland 
##                       12                       39                        1 
##                 Scotland               South East               South West 
##                       32                       67                       37 
##                    Wales            West Midlands Yorkshire and The Humber 
##                       22                       30                       21

Dummy Variables

## 
## Call:
## lm(formula = Pct_Lev ~ RegionU, data = brexit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.8121  -4.7843   0.4257   5.1457  30.5685 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     39.13687    1.39400  28.075  < 2e-16 ***
## RegionULondon                   -0.04536    1.95643  -0.023    0.982    
## RegionUNorth West               16.77825    1.88088   8.920  < 2e-16 ***
## RegionUYorkshire and The Humber 19.51312    2.21458   8.811  < 2e-16 ***
## RegionUNorth East               20.34229    2.66931   7.621 2.15e-13 ***
## RegionUWest Midlands            21.17779    2.00400  10.568  < 2e-16 ***
## RegionUEast Midlands            20.43763    1.87025  10.928  < 2e-16 ***
## RegionUSouth West               14.54745    1.90365   7.642 1.87e-13 ***
## RegionUEast                     17.82525    1.80729   9.863  < 2e-16 ***
## RegionUSouth East               13.03312    1.69451   7.691 1.34e-13 ***
## RegionUWales                    14.21085    2.18398   6.507 2.50e-10 ***
## RegionUNorthern Ireland          5.08312    8.00793   0.635    0.526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.886 on 369 degrees of freedom
## Multiple R-squared:  0.443,  Adjusted R-squared:  0.4264 
## F-statistic: 26.68 on 11 and 369 DF,  p-value: < 2.2e-16

How is this different from what you produced?

Dummy Variables

During the Brexit vote, many areas thought to support “Remain” experienced heavy rainfall. total_precip_prepolling is the number of mm of rain that fell just before polls were open.

  1. Regress Pct_Lev on total_precip_prepolling + Region.
  2. Thinking about the least squares estimator: how is the slope on total_precip_prepolling on calculated in this case? (think of \(X_k^*\))
  3. In light of \(2\), what does the slope on total_precip_prepolling mean?
## 
## Call:
## lm(formula = Pct_Lev ~ total_precip_prepolling + Region, data = brexit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.7877  -4.7351   0.4749   4.7649  28.7536 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     52.55803    1.49590  35.135  < 2e-16 ***
## total_precip_prepolling          0.05186    0.01165   4.451 1.13e-05 ***
## RegionEast Midlands              7.01178    1.92738   3.638 0.000314 ***
## RegionLondon                   -19.26625    1.77485 -10.855  < 2e-16 ***
## RegionNorth East                 6.92113    2.67736   2.585 0.010120 *  
## RegionNorth West                 3.35710    1.93773   1.732 0.084024 .  
## RegionNorthern Ireland          -8.33820    7.83610  -1.064 0.287993    
## RegionScotland                 -13.42789    2.02081  -6.645 1.10e-10 ***
## RegionSouth East                -3.11209    1.51142  -2.059 0.040193 *  
## RegionSouth West                 1.11001    1.95693   0.567 0.570911    
## RegionWales                      0.78957    2.21970   0.356 0.722260    
## RegionWest Midlands              7.75486    2.05162   3.780 0.000183 ***
## RegionYorkshire and The Humber   6.09083    2.24826   2.709 0.007061 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.692 on 368 degrees of freedom
## Multiple R-squared:  0.4715, Adjusted R-squared:  0.4542 
## F-statistic: 27.35 on 12 and 368 DF,  p-value: < 2.2e-16

Exercise:

\(2\): We regress Percent Leave on the residual pre-election precipitation.

The other variable are dummies for region. Thus, the \(\widehat{precipitation_i}\) is the mean precipitation in each region.

So, residual \(precipitation_i^*\), is difference between precipitation in a consituency and the mean precipitation in the region.

Dummy Variables

You can change the “reference group” like this. This code shifts “Scotland” to the reference by making it the first “level” in the factor.

levels = c('Scotland', 'London', 'North West', 'Yorkshire and The Humber', 'North East', 'West Midlands', 'East Midlands', 'South West', 'East', 'South East', 'Wales', 'Northern Ireland')
brexit[, RegionU := factor(Region, levels = levels)]

Interpretation

Example

Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.

Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.

  • UHRSWORK (the usual hours worked per week)
  • FEMALE (an indicator for female)
  • AGE (in years)
  • LAW (an indicator for being a lawyer)
  • INCEARN (total annual income earned in dollars).

Example

Let’s now find the (approximate) linear conditional expectation function of earnings, across hours worked, age, profession, and gender:

Example

## 
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -304486  -90963  -29238   72845  743690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11083.5     9484.6   1.169    0.243    
## UHRSWORK      1075.5      114.7   9.373   <2e-16 ***
## AGE           3452.8      125.2  27.578   <2e-16 ***
## LAW         -48919.9     2674.6 -18.291   <2e-16 ***
## FEMALE      -37223.1     2799.8 -13.295   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 127800 on 9995 degrees of freedom
## Multiple R-squared:  0.146,  Adjusted R-squared:  0.1457 
## F-statistic: 427.2 on 4 and 9995 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on FEMALE?

Example

Example

Example

Example

Exercise:

## `geom_smooth()` using formula 'y ~ x'

Example

Previous slide shows linear approximation of CEF of earnings by age.

In your small groups:

  1. Can you think of a way to use least squares to better approximate the true conditional expectation function of earnings by Age?

  2. How would you do it?

  3. Try it out

  4. How would you incorporate this into the example above? ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)?